Performance: build DocIdSetIterator in ArrayHitCounter to enable future optimizations #718

alexklibisz · 2024-08-25T23:53:02Z

Related Issue

Changes

I have some ideas for improvements to the ArrayHitCounter. Most of the ideas are tightly-coupled to the way the DocIdSetIterator is constructed, so I'd like to bring that functionality into the HitCounter. That's what this PR is doing.

Testing and Validation

Added new tests and re-ran benchmarks

…ery to ArrayHitCounter

joancf · 2024-08-26T09:58:48Z

@alexklibisz I was going to do a pr, to solve the issue that the old HitCounter had, returning documents with zero hits, I see that you still have the error,
the simplest way to solve it is

I also analyzed a bit you code, lets compute the time if a shard has D documents, H of which are a hit in any of the L hash tables.
The current version has a complexity or O(H) to fill the table and O(D) in time and space to find the top ones.
I also thought that maybe using a hash key/freq could be better, but numbers are difficult to be compared
Using a hashmap, the time to put the data is the hash is O(H) , then a an array of lists (List[] ) of size max_freq, could be created to store them by freq, the time needed to fill them, is (O(H) )
The time and space needed depends on the values of L,K
as for example a value 10 for K will select 0,1% of the documents, and if L=100 , this implies H to be <10% of the documents
I think that using a hashmap can be smaller in space and time.

joancf · 2024-08-26T10:52:27Z

let me add a further comment about L and K
K=1 implies a division of the space in 2 parts, by a random plane, so we should expect that 50% of the documents are at each side.
I we do K=1 and L=m, then the probability of a document to be a positive hit for any L is 1-(prob of being a negative one for all L) this is 1-(0,5^m) , which means that with L >10 is 99,9%
So kith k=1 and L=99 we are considering almost all the documents.
with K = 10 and L=99 the prob of being a hit in any hashtable is 0,5^10 , and being a negative one so
at the end we come with the probability of a document to be a hit (at leas in one has table) to be
1−(2k−1)l
Here are the computed values of the expression ( 1 - \left(\frac{2^k - 1}{2^k}\right)^l ) for different values of ( k ) (from 1 to 10) and ( l ) (from 10 to 100, in steps of 10):

k \ l	10	20	30	40	50	60	70	80	90	100
1	0.999023	0.999999	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
2	0.943686	0.996829	0.999821	0.99999	0.999999	1.0	1.0	1.0	1.0	1.0
3	0.736924	0.930791	0.981793	0.99521	0.99874	0.999669	0.999913	0.999977	0.999994	0.999998
4	0.47554	0.724941	0.855743	0.924343	0.960321	0.97919	0.989086	0.994276	0.996998	0.998426
5	0.272024	0.470051	0.61421	0.719154	0.795551	0.851166	0.891652	0.921125	0.942581	0.9582
6	0.145709	0.270187	0.376528	0.467373	0.544982	0.611282	0.667922	0.716309	0.757645	0.792958
7	0.075435	0.145179	0.209663	0.269282	0.324403	0.375367	0.422486	0.466051	0.506329	0.543569
8	0.038383	0.075293	0.110786	0.144916	0.177737	0.209298	0.239647	0.268832	0.296896	0.323884
9	0.01936	0.038346	0.056964	0.075222	0.093126	0.110684	0.127901	0.144785	0.161343	0.17758
10	0.009723	0.019351	0.028886	0.038328	0.047678	0.056937	0.066106	0.075186	0.084178	0.093083

These values represent how the expression changes as ( k ) and ( l ) increase.

so a value of k=10 and L =100 will select 1/10 of the documents but with L = 100 and K < 6 selects most of the documents. This impacts you benchmarks here #160 (comment) , as you always use k=4 l and L =100 , so 99% of the documents are a hit!

…lastiknn into 160-better-counter

alexklibisz added 7 commits August 25, 2024 13:45

Performance: some improvements to hit counting

99bf0b1

WIP

f1a58fc

WIP

684f8af

Unit tests pass but integration tests still failing

e7ac3a6

Better but still some tests failing

c30fe36

Literally just yanked the implementation out of MatchHashesAndScoreQu…

1e008ba

…ery to ArrayHitCounter

Tests passing locally

56348f2

alexklibisz changed the title ~~Performance: add DocIdSetIterator to ArrayHitCounter~~ Performance: build DocIdSetIterator in ArrayHitCounter Aug 26, 2024

alexklibisz mentioned this pull request Aug 26, 2024

160_b avoid return zero hits #719

Closed

alexklibisz added 8 commits August 28, 2024 10:10

Some progress on the tests

a0b0040

Unit tests passing again

19d47fc

Scalafmt

054fb4a

Add a test demonstrating the issue

66c4151

Remove an unused test

30f2d73

Updated benchmarks

07f9557

Merge branch '160-better-counter' of https://github.com/alexklibisz/e…

9b4b5f8

…lastiknn into 160-better-counter

Merge remote-tracking branch 'origin/main' into 160-better-counter

8b8b2f9

alexklibisz marked this pull request as ready for review August 28, 2024 18:18

alexklibisz changed the title ~~Performance: build DocIdSetIterator in ArrayHitCounter~~ Performance: build DocIdSetIterator in ArrayHitCounter to enable future optimizations Aug 28, 2024

alexklibisz merged commit c5a8e21 into main Aug 28, 2024
5 checks passed

alexklibisz deleted the 160-better-counter branch August 28, 2024 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: build DocIdSetIterator in ArrayHitCounter to enable future optimizations #718

Performance: build DocIdSetIterator in ArrayHitCounter to enable future optimizations #718

alexklibisz commented Aug 25, 2024 •

edited

Loading

joancf commented Aug 26, 2024

joancf commented Aug 26, 2024

Performance: build DocIdSetIterator in ArrayHitCounter to enable future optimizations #718

Performance: build DocIdSetIterator in ArrayHitCounter to enable future optimizations #718

Conversation

alexklibisz commented Aug 25, 2024 • edited Loading

Related Issue

Changes

Testing and Validation

joancf commented Aug 26, 2024

joancf commented Aug 26, 2024

alexklibisz commented Aug 25, 2024 •

edited

Loading